I suggest running this code so that all your numbers are not in scientific notation:
For this project I used 4 data sets. The first one has details about coronavirus cases in the USA by county, which would be the equivalent of NUTS 3 regions. This comes from the New York Times github
The second one has socio-economic and demographic data about the US by county which was taken from the Center for Disease Control.
The third data set has information about the number and characteristics of each airport by county, which I took from the Federal Aviation Administration.
I cleaned as well as joined the second and third data sets in Excel into one.
Creating Interactive Data Visualizations
Based on both a lecture I had the other day on R, Python and Julia and reading things online, I have come to the conclusion that data visualizations using base R versus packages like ggplot are used on two different occasions.
Base R can be recommended to be used when you want to ensure the longevity of your code. Meaning, because base R remains the same, you can be sure of your code functioning on any computer, for years to come.
One of the disadvantages of base R is, of course, the fact that you won’t get the same level of creativity as you do with R packages.
On the flip side, packages can be recommended to be used when you’re not so concerned about longevity so much as the breadth of what you can accomplish. Meaning, if you have a one-off project like this or need to create some one-off report, R packages can offer you a lot more creativity.
The disadvantage of using packages is that it will be subject to package updates or even packages becoming obsolete in the future.
Creating Interactive Graphs
To make the following interactive graphs and maps, I am using two packages:
If you want to see some basic interactive charts, which where the inspiration for the following ones, check out R Graph Gallery’s section on interactive charts
Here, you can see that the number of unique counties reporting at least 1 coronavirus case has steadily grown. However, at the latest date, this growth has finally slowed down.
In this graph, you can see the cumulative number of cases reported. As of April 13th, this number is still growing and the United States currently has the most COVID-19 cases in the world.
Creating Interactive Maps
Creating maps in R can be done in both base R and by using packages. Choosing to use packages can add a level of interactivity that can help your users better understand and explore your data. For the following interactive graphs, I continued to use the plotly package. Some other packages you can to create widgets and interactive maps use include:
- leaflet
- tmap
- shiny
- mapview
If you want to create a basic interactive map, you can start with R Graph Gallery’s section on interactive maps.
If you want to see some of the crazy details you can add to your visualizations, check out Geocomputation with R’s chapter on making maps in R
The basics of creating any map include two vital elements:
- Data with geospatial information: in my case, I am using two letter state codes and fips county codes
- A geospatial file such as GeoJSON: in my case, I downloaded it from plotly’s github
The geospatial file contains information that will help create your map, such as geometric info and info on latitudes and longitudes. It is important that your data include some sort of geospatial information, such as country or regional codes, because it acts as a bridge between your geospatial file and your data file.
In the following map, you can see the reported cumulative cases for each state. As you can see in the legend, the brighter the color, the higher the amount of cumulative cases a state has reported.
Below, you can see two tables presenting the above information in tabular form. New York and New Jersey have the highest cases. You may have noticed that these states are white in the above graph - this is because their numbers where so high that not even converting the data into percentiles accurately depicted the range of cases in the US.
Meaning, because NY and NJ have such a high number of reported cases, it made it very difficult to find an appropriate way to display it with the rest of the data. I’m still working on experimenting with my plotly code to get two different maps to one so I can present NY and NJ in a grey color with a different legend.
|
State
|
Cumulative Cases
|
Rank
|
|
New York
|
195031
|
1
|
|
New Jersey
|
64584
|
2
|
|
Massachusetts
|
26867
|
3
|
|
Michigan
|
25490
|
4
|
|
California
|
24334
|
5
|
|
Pennsylvania
|
24295
|
6
|
|
Illinois
|
22028
|
7
|
|
Louisiana
|
21016
|
8
|
|
Florida
|
21011
|
9
|
|
Texas
|
14488
|
10
|
Below, you can see the states with the lowest reported coronavirus cases as of 13 April.
|
State
|
Cumulative Cases
|
Rank
|
|
Northern Mariana Islands
|
11
|
1
|
|
Virgin Islands
|
51
|
2
|
|
Alaska
|
275
|
3
|
|
Wyoming
|
275
|
4
|
|
North Dakota
|
331
|
5
|
|
Montana
|
394
|
6
|
|
Hawaii
|
502
|
7
|
|
West Virginia
|
638
|
8
|
|
Maine
|
698
|
9
|
|
Guam
|
716
|
10
|
In this next map, you can see the amount of cases reported by counties. Like in the previous graph, the lighter the color, the higher amount of cases reported in the county. The main difference between this map and the previous one, besides the difference in geographical units, is that this map presents the range in percentiles.
This can often be a practical solution when you’re dealing with data sets with a wide range.
Modeling Coronavirus Cases
There are always multiple angles to take when analyzing a data set, especially one with so many variables. In the first part of this section, I will use two models to make predictions for the rest of the month’s coronavirus cases:
- Logistic Growth Model
- Log-Linear Model
In the second part, I try to see the relationship between socio-econnomic factors and cumulative cases as of the 13th of April.
Logistic Growth Model
Logistic growth models are used when there is rapid and increasing growth in the beginning stages of a phenomenon, but then decreases at a later stage when you start reaching some “carrying capacity.” I suggest checking out this page on Duke University’s website for a simple explanation and example of logistic growth models as well as the video some of you may have already seen dealing specifically with the pandemic.
Following the example of the code for the logistic growth model by this blog, I took the following as my model.
\[
Cases_{day(t)} = \dfrac{asymptote}{1 + e^{((midpoint - t)*scale)}}
\]
I used a self starter function called SSLogis, as well as a nonlinear (weighted) least-squares (or NLS) model for this with the nls function in r. Documentation and more information for this package can be found on r documentation. My model estimated the asymptote, midpoint and scale of the logistic growth model and looks like this:
\[
Cumulative \; Cases_{day(t)} = \dfrac{\phi_{1}}{1 + e^{((\phi_{2} - t)*\phi_{3})}}
\]
The self starting nls logistic model creates an initial estimate of the \(\phi\) parameters.
As we can see from the summary table, all parameters are statistically significant at the p<0 value.
##
## Formula: cumulative.cases ~ SSlogis(daysince, phi1, phi2, phi3)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## phi1 698495.7505 9421.9639 74.14 <0.0000000000000002 ***
## phi2 76.2966 0.1705 447.45 <0.0000000000000002 ***
## phi3 5.1101 0.0754 67.77 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4360 on 81 degrees of freedom
##
## Number of iterations to convergence: 0
## Achieved convergence tolerance: 0.000002029
Next, we want to plot our model along with the actual data points. I managed to graph my model thanks to the code on this blog.
I chose not to make this plot interactive as the code for showing the equation clashed with plotly and I figured it would make my life easier to just show a static version. As you can see, the actual data points are in blue while the model fit is displayed by the red curve. The model maps the data almost perfectly.
Using this model, I predicted the cases up until the 13th of April. In other words, I tested my model by making an interpolation with all the dates available. Looking at the graph, the model performs very well for interpolation.
The next step, naturally, is to perform an extrapolation until April 21st. Because the NYT data set wasn’t updated to the 21st, I imputed values from the Worldometer’s infamous COVID-19 live tracker. While it is helpful to keep in mind that the NYT and Worldometer’s data sets can vary, it is clear that the model under predicts the actual cases.
Compared to the actual growth, the model seems to have predicted parameters that forecast a decrease in growth. In reality, these cases show no signs of slowing down.
Exponential Growth Model
In this part, I will try using a exponential growth model to predict coronavirus cases instead. This model is inspired by Toward Data Science’s page on modeling exponential growth on coronavirus cases.
An exponential growth model is pretty self-explanatory: it aims to model a variable with exponential growth. The equation of my model is as follows:
\[
x(t) = x_{0}*b^t
\]
Where:
- \(x(t)\) is the number of cases at time \(t\)
- \(x_{0}\) is the initial number of cases
- \(b\) is the growth factor, which is the number of people that are infected by each sick person
In order to coerce this formula into a linear expression, we simply take the logarithm of each variable so that the equation turns into:
\[
log \; x(t) = log(x_{0}) + log(b) * t
\]
As you can see from the summary table, this linear model has a high \(R^2\) combined with parameters that are both significant at the p<0 value.
##
## Call:
## lm(formula = caseslog ~ daysince, data = total.log.cases.byday)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6377 -0.7885 0.1421 0.8513 1.5553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.975068 0.204840 -4.76 0.00000821 ***
## daysince 0.171532 0.004186 40.97 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9303 on 82 degrees of freedom
## Multiple R-squared: 0.9534, Adjusted R-squared: 0.9529
## F-statistic: 1679 on 1 and 82 DF, p-value: < 0.00000000000000022
The next step is to use this model to predict the number of cases up to the 20th of April and see if it performs better than the logistic growth model.
Plotting the projection and the real data points together, we can see that this model over predicts the number of cases by a lot. As a result, we can see that none of these models fit quite right.
Needless to say, forecasting is already difficult but doing so from the epidemiology standpoint is even more so - at least in my opinion. At this point I would probably go back to the SSLogis model and play around with the parameters, making hypothesis for what they will be.
If you’re interested in looking at some projections that have been done for your country, check out Health Data’s projections until August.
Socio-Economic Model
What is interesting about any model is trying to tease out whether or not any other factors play a role in having a high number of coronavirus cases or deaths. While coronavirus cases are still growing right now, and therefore require models specific to pandemics, there will come a time when they stop growing.
In this socio-economic model, I’m curious to know if there are any shared characteristics between counties with low and high coronavirus cases. For this part, I will be treating the cumulative cases on the 13th of April as the final amount of coronavirus cases.
According to the World Health Organization, older people and people who have pre-existing medical conditions - asthma, diabetes, heart disease - are more at risk of contracting COVID-19.
Health isn’t the only factor involved in the spread of coronavirus. According to the L.A. Times, “black people continue to see the highest COVID-19 death rate in L.A. County.” Looking at the death rates per race as of the 26th of April, we can see a significant difference: 13 deaths per 100,000 for black people, 9.5 for Latinos, 7.5 for Asian people and 5.5 for white people.
Of course, being a certain race doesn’t make you more or less susceptible to the novel virus. Certain socio-economic factors can be the underlying cause of these disparate death rates, such as: owning insurance, household income, number of airports in your county, access to hospitals, etc.
There are many different ways you can play with this data. You can employ both parametric and non-parametric tests:
- Cluster analysis
- Factor analysis
- Linear regression
- Quantile regression
Here, I’ve chosen to simply perform the last two for simplicity. I won’t go into significant detail, this is more just for demonstration on the types of analysis you can do.
Simple Linear Regression
First, I check the spread of the variables, as one of the assumptions for linear regression is normality.
|
Correlation Coefficient with Cumulative Cases
|
Variable Name
|
|
0.3164380
|
population
|
|
0.2814870
|
hospitals
|
|
0.2760621
|
med_home_val_000s
|
|
-0.1832606
|
urban_rural
|
|
0.1620762
|
airport
|
|
0.1475656
|
perc_sev_housing
|
|
0.1416452
|
med_hh_inc_000s
|
|
0.1346236
|
gini
|
|
-0.0646698
|
diabetes_perc
|
|
-0.0619975
|
cardio_deathrt_per_hundthou_ov35
|
|
-0.0578892
|
pop65up_perc
|
|
-0.0549134
|
prim_phys
|
|
-0.0479434
|
perc_wo_ins_und65
|
|
0.0432933
|
airpm2.5conc
|
|
-0.0316094
|
perc_poverty
|
|
0.0307291
|
femalehd_perc
|
|
-0.0287720
|
hyperten_deathrt_perhundthou_over35
|
|
0.0267720
|
cost_percap_medicare_hd
|
|
-0.0243113
|
nohsdip_perc
|
|
-0.0169363
|
foodstmp_perc
|
|
-0.0144535
|
unemp_rt
|
|
0.0026920
|
hd_among_ins_perc
|
|
|
Skewness
|
|
cases
|
38.0928215
|
|
population
|
13.5106668
|
|
hospitals
|
12.7503638
|
|
airport
|
5.7501017
|
|
prim_phys
|
5.0997760
|
|
med_home_val_000s
|
3.4016067
|
|
unemp_rt
|
1.9853510
|
|
hyperten_deathrt_perhundthou_over35
|
1.4120476
|
|
med_hh_inc_000s
|
1.3876523
|
|
perc_sev_housing
|
1.3632143
|
|
femalehd_perc
|
1.2155457
|
|
perc_poverty
|
1.0037907
|
|
foodstmp_perc
|
0.9494621
|
|
pop65up_perc
|
0.9247539
|
|
nohsdip_perc
|
0.9005759
|
|
perc_wo_ins_und65
|
0.7816133
|
|
cardio_deathrt_per_hundthou_ov35
|
0.7057460
|
|
cost_percap_medicare_hd
|
0.6428656
|
|
airpm2.5conc
|
-0.4747358
|
|
gini
|
0.3984492
|
|
diabetes_perc
|
0.3452524
|
|
hd_among_ins_perc
|
0.0635327
|
You can go ahead and explore the cor, because there are a couple of variables that are highly skewed, I suggest we do a logarithmic transformation on them. The new correlation table is below.
|
Correlation Coefficient with Cumulative Cases
|
Log of Variable
|
|
-0.6000246
|
urban_rural
|
|
0.5435196
|
airport
|
|
0.5258378
|
population
|
|
0.4903658
|
hospitals
|
|
0.4462501
|
med_home_val_000s
|
|
0.4161428
|
med_hh_inc_000s
|
|
0.3887019
|
perc_sev_housing
|
|
-0.3862815
|
pop65up_perc
|
|
0.2716570
|
airpm2.5conc
|
|
0.2533475
|
femalehd_perc
|
|
-0.2281880
|
prim_phys
|
|
-0.1713834
|
cardio_deathrt_per_hundthou_ov35
|
|
0.1651199
|
gini
|
|
-0.1546424
|
perc_poverty
|
|
-0.1540688
|
nohsdip_perc
|
|
-0.1282357
|
perc_wo_ins_und65
|
|
-0.0572528
|
hyperten_deathrt_perhundthou_over35
|
|
-0.0451831
|
diabetes_perc
|
|
-0.0316169
|
foodstmp_perc
|
|
-0.0268021
|
hd_among_ins_perc
|
|
-0.0267021
|
unemp_rt
|
|
0.0232935
|
cost_percap_medicare_hd
|
Now, you can see that all the variables are either moderately skewed or approximately symmetric. While I am worried about multi-collinearity, this is something that I’ll check afterwards.
|
|
Skewness
|
|
foodstmp_perc
|
0.9494621
|
|
pop65up_perc
|
0.9247539
|
|
nohsdip_perc
|
0.9005759
|
|
med_home_val_000s
|
0.7990006
|
|
perc_wo_ins_und65
|
0.7816133
|
|
cases
|
0.7094941
|
|
cardio_deathrt_per_hundthou_ov35
|
0.7057460
|
|
prim_phys
|
0.6748102
|
|
cost_percap_medicare_hd
|
0.6428656
|
|
population
|
0.5398575
|
|
airpm2.5conc
|
-0.4747358
|
|
gini
|
0.3984492
|
|
med_hh_inc_000s
|
0.3965121
|
|
diabetes_perc
|
0.3452524
|
|
hyperten_deathrt_perhundthou_over35
|
-0.2749007
|
|
perc_poverty
|
-0.2230960
|
|
perc_sev_housing
|
-0.1926100
|
|
femalehd_perc
|
-0.1646569
|
|
unemp_rt
|
0.0869987
|
|
hd_among_ins_perc
|
0.0635327
|
|
hospitals
|
NaN
|
|
airport
|
NaN
|
##
## Call:
## lm(formula = cases ~ ., data = linfit1data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1270 -0.6139 -0.0215 0.5718 3.4639
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -20.24369815 1.51148432 -13.393
## population 0.85059592 0.03604451 23.598
## hospitals 0.12687856 0.05620829 2.257
## airport 0.02510218 0.03780417 0.664
## prim_phys -0.02437483 0.04126718 -0.591
## med_home_val_000s 0.22044494 0.10788088 2.043
## unemp_rt -0.16970540 0.08270335 -2.052
## hyperten_deathrt_perhundthou_over35 -0.25602421 0.05248973 -4.878
## med_hh_inc_000s 1.95054027 0.31004955 6.291
## perc_sev_housing 0.23464258 0.11911015 1.970
## femalehd_perc 1.03526165 0.11319939 9.145
## perc_poverty 0.38124217 0.16866368 2.260
## foodstmp_perc -0.00300026 0.00648071 -0.463
## pop65up_perc 0.01046893 0.00635761 1.647
## nohsdip_perc -0.00566500 0.00568505 -0.996
## perc_wo_ins_und65 0.00011427 0.00577856 0.020
## cardio_deathrt_per_hundthou_ov35 0.00099043 0.00033915 2.920
## cost_percap_medicare_hd 0.00001291 0.00001015 1.273
## gini 5.02035562 0.78481070 6.397
## diabetes_perc 0.03010349 0.01517702 1.983
## hd_among_ins_perc -0.00600691 0.00564833 -1.063
## airpm2.5conc 0.00705227 0.01482253 0.476
## urban_rural2 -0.06928934 0.15730728 -0.440
## urban_rural3 -0.30601285 0.15132883 -2.022
## urban_rural4 -0.49783790 0.16160910 -3.081
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## population < 0.0000000000000002 ***
## hospitals 0.02408 *
## airport 0.50675
## prim_phys 0.55481
## med_home_val_000s 0.04112 *
## unemp_rt 0.04028 *
## hyperten_deathrt_perhundthou_over35 0.000001144641 ***
## med_hh_inc_000s 0.000000000374 ***
## perc_sev_housing 0.04896 *
## femalehd_perc < 0.0000000000000002 ***
## perc_poverty 0.02389 *
## foodstmp_perc 0.64344
## pop65up_perc 0.09976 .
## nohsdip_perc 0.31912
## perc_wo_ins_und65 0.98423
## cardio_deathrt_per_hundthou_ov35 0.00353 **
## cost_percap_medicare_hd 0.20331
## gini 0.000000000190 ***
## diabetes_perc 0.04743 *
## hd_among_ins_perc 0.28767
## airpm2.5conc 0.63427
## urban_rural2 0.65964
## urban_rural3 0.04327 *
## urban_rural4 0.00209 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9616 on 2383 degrees of freedom
## (254 observations deleted due to missingness)
## Multiple R-squared: 0.7489, Adjusted R-squared: 0.7463
## F-statistic: 296.1 on 24 and 2383 DF, p-value: < 0.00000000000000022
Here we see that the model has a great \(R^2\) value, although some of the variables are not statistically significant. Because we might have a problem with multicollinearity, we have to check out the VIF, or variance inflation factor.
## GVIF Df GVIF^(1/(2*Df)) vars
## 1 14.677516 1 3.831125 med_hh_inc_000s
## 2 11.145020 1 3.338416 perc_poverty
## 3 5.904324 1 2.429881 med_home_val_000s
## 4 5.589864 1 2.364289 population
## 5 5.302795 1 2.302780 foodstmp_perc
## 6 3.795620 1 1.948235 femalehd_perc
## 7 3.156536 1 1.776664 nohsdip_perc
## 8 3.153826 3 1.210987 urban_rural
## 9 3.067385 1 1.751395 diabetes_perc
## 10 3.032982 1 1.741546 cardio_deathrt_per_hundthou_ov35
## 11 2.834033 1 1.683459 perc_sev_housing
## 12 2.744162 1 1.656551 hospitals
## 13 2.383997 1 1.544020 hd_among_ins_perc
## 14 2.196092 1 1.481922 airport
## 15 1.946701 1 1.395242 airpm2.5conc
## 16 1.911237 1 1.382475 perc_wo_ins_und65
## 17 1.873603 1 1.368796 gini
## 18 1.861858 1 1.364499 unemp_rt
## 19 1.854786 1 1.361905 pop65up_perc
## 20 1.679301 1 1.295878 prim_phys
## 21 1.489281 1 1.220361 cost_percap_medicare_hd
## 22 1.250564 1 1.118286 hyperten_deathrt_perhundthou_over35
Looking at the VIFs, we can see that some variables, like population or hospitals, have high VIFs. This makes sense, as they all have high correlation coefficients with one or more variables in the data set. You can check out the correlation matrix below.
There are many ways to deal with multicollinearity, however I choose to simply take some of these highly correlated variables out and run a second regression.
|
|
cases
|
med_hh_inc_000s
|
pop65up_perc
|
hospitals
|
prim_phys
|
perc_wo_ins_und65
|
cost_percap_medicare_hd
|
hd_among_ins_perc
|
airport
|
|
cases
|
1.0000000
|
0.1416452
|
-0.0578892
|
0.2814870
|
-0.0549134
|
-0.0479434
|
0.0267720
|
0.0026920
|
0.1620762
|
|
med_hh_inc_000s
|
0.1416452
|
1.0000000
|
-0.2823681
|
0.1968010
|
-0.2001174
|
-0.3854992
|
-0.2680835
|
-0.3598292
|
0.2847773
|
|
pop65up_perc
|
-0.0578892
|
-0.2823681
|
1.0000000
|
-0.1860082
|
0.0248048
|
-0.0279625
|
-0.1437373
|
-0.0548488
|
-0.1907273
|
|
hospitals
|
0.2814870
|
0.1968010
|
-0.1860082
|
1.0000000
|
-0.1975375
|
-0.0647762
|
0.0971480
|
-0.0448295
|
0.7733669
|
|
prim_phys
|
-0.0549134
|
-0.2001174
|
0.0248048
|
-0.1975375
|
1.0000000
|
0.2015667
|
0.1390482
|
0.2031435
|
-0.2019914
|
|
perc_wo_ins_und65
|
-0.0479434
|
-0.3854992
|
-0.0279625
|
-0.0647762
|
0.2015667
|
1.0000000
|
0.2725076
|
0.2446784
|
-0.0909493
|
|
cost_percap_medicare_hd
|
0.0267720
|
-0.2680835
|
-0.1437373
|
0.0971480
|
0.1390482
|
0.2725076
|
1.0000000
|
0.3918527
|
0.0232562
|
|
hd_among_ins_perc
|
0.0026920
|
-0.3598292
|
-0.0548488
|
-0.0448295
|
0.2031435
|
0.2446784
|
0.3918527
|
1.0000000
|
-0.0547139
|
|
airport
|
0.1620762
|
0.2847773
|
-0.1907273
|
0.7733669
|
-0.2019914
|
-0.0909493
|
0.0232562
|
-0.0547139
|
1.0000000
|
Here, we can see a table with the new correlation matrix with the variables I chose to keep.
##
## Call:
## lm(formula = log(cases) ~ ., data = linfit2data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9285 -0.9419 0.0232 0.9310 6.4516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.846370562 0.424399829 1.994 0.0462
## med_hh_inc_000s 0.038925510 0.002757431 14.117 < 0.0000000000000002
## pop65up_perc -0.096738731 0.007521974 -12.861 < 0.0000000000000002
## hospitals 0.093817992 0.016516824 5.680 0.000000015075
## prim_phys -0.077948842 0.012504279 -6.234 0.000000000536
## perc_wo_ins_und65 -0.002650449 0.006907690 -0.384 0.7012
## cost_percap_medicare_hd 0.000002291 0.000014103 0.162 0.8709
## hd_among_ins_perc 0.035774401 0.006208671 5.762 0.000000009371
## airport 0.159235175 0.012889401 12.354 < 0.0000000000000002
##
## (Intercept) *
## med_hh_inc_000s ***
## pop65up_perc ***
## hospitals ***
## prim_phys ***
## perc_wo_ins_und65
## cost_percap_medicare_hd
## hd_among_ins_perc ***
## airport ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.425 on 2412 degrees of freedom
## (241 observations deleted due to missingness)
## Multiple R-squared: 0.4444, Adjusted R-squared: 0.4425
## F-statistic: 241.1 on 8 and 2412 DF, p-value: < 0.00000000000000022
This second linear regression shows an okay \(R^2\), significantly lower than our previous model, and some variables that are not statistically significant.
Quantile Regression
The best way of explaining quantile regression is to quote the experts. Authors Cook and Manning, in their paper entitled “Thinking beyond the mean” for the Shanghai Archives of Psychiatry, wrote:
"Quantile regression is not a regression estimated on a quantile, or subsample of data as the name may suggest….
In OLS regression, the goal is to minimize the distances between the values predicted by the regression line and the observed values. In contrast, quantile regression differentially weights the distances between the values predicted by the regression line and the observed values, then tries to minimize the weighted distances. Alternatively, this can be viewed as weighting the distances between the values predicted by the regression line and the observed values below the line (negative residuals) by 0.5, and weighting the distances between the values predicted by the regression line and the observed values above the line (positive residuals) by 1.75. Doing so ensures that minimization occurs when 75 percent of the residuals are negative."
Because quantile regression strives to estimate parameters at different deciles, the assumption of normality is not required. Because of this, I think it will be interesting to understand whether or not socio-economic and demographic factors vary at differing quantiles of cumulative cases. Meaning, do counties with higher coronavirus cases exhibit different behavior than those with the lowest cases for the same variables?

If you’ve never seen this plot before, don’t be scared - it’s actually quite simple. There are some variables with significant differences at the lower an upper percentiles, like female head of household percentage for example. For each percentage point of increase in female head of household variable in the .1 quantile, there about 0.7 more cases. This number at the 0.9 quantile is about 2.
Check out some of the other variables and see what interesting relationships you can find. You can also perform an ANOVA test to test whether or not the coefficients are statistically different at different deciles. Because we are doing a joint test of equality of slopes, which is a test of equality on all the coefficients, you can think of it as a “slope homogeneity test.”
\[
H_{o}: the \; slopes \; are \; homogenous
\] \[
H_{1}: non \; H_{o}
\]
## Quantile Regression Analysis of Deviance Table
##
## Model: cases ~ population + hospitals + airport + prim_phys + med_home_val_000s + unemp_rt + hyperten_deathrt_perhundthou_over35 + med_hh_inc_000s + perc_sev_housing + femalehd_perc + perc_poverty + foodstmp_perc + pop65up_perc + nohsdip_perc + perc_wo_ins_und65 + cardio_deathrt_per_hundthou_ov35 + cost_percap_medicare_hd + gini + diabetes_perc + hd_among_ins_perc + airpm2.5conc + urban_rural
## Joint Test of Equality of Slopes: tau in { 0.1 0.9 }
##
## Df Resid Df F value Pr(>F)
## 1 24 4792 6.3954 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With a p-value of p<0, we reject the null hypothesis and retain the alternative hypothesis. This supports the fact that counties at the higher deciles for coronavirus cases have different parameters for socio-economic and demographic variables than the lower deciles.
Conclusion
Analyses are like points of view - you may see a data set in an entirely different light than even your closest friend. My analysis is by no means complete - if you saw something really important missing go ahead and download the data for your region and add onto it from your own unique perspective!
In conclusion, the things I went over were:
- Exploratory analysis in R
- Interactive graphs and maps in R
- Modeling exponential growth
Please keep in mind that you should always go through your variables thoroughly, using both descriptive statistics and visualizations, before doing any type of analysis. I held back here because I wanted to show what was possible without going into too much detail - if you’re like me, you could probably deal with the same data set forever and still find other ways to examine/present your findings.
Some ways you can add on or improve to this type of analysis are:
- Explore death rates for coronavirus
- Create an interactive graph for the infamous “curve” everyone speaks about flattening
- Talk about other socio-economic and demographic aspects related to the coronavirus, such as this one
- Perform a more complete logistic growth, linear or quantile regression
And here are some more references to get you going with visualizations in R with some examples of the pages I used.
And of course all the other helpful ones like Stackoverflow and R Graph Gallery.
If you have any questions, feel free to let me know and thank you for your time!